Prediction of Health Outcomes Based on Various Environmental and Socioeconomic Factors

Advanced Data Science

Yunyang Zhong
12-14-2021
library(tidyverse)         # for graphing and data cleaning
library(tidymodels)        # for modeling
library(stacks)            # for stacking models
library(naniar)            # for analyzing missing values
library(vip)               # for variable importance plots
library(usemodels)         # for tidymodels suggestions
library(xgboost)           # for boosting - need to install, don't need to load
library(doParallel)        # for parallel processing
library(lubridate)         # for dates
library(moderndive)        # for King County housing data
library(patchwork)         # for combining plots nicely
library(rmarkdown)         # for paged tables
library(shiny)             # for shiny dashboard
library(bslib)             # for theming
library(usmap)             # for maping
library(Metrics)           # for calculating metric
theme_set(theme_minimal()) # Lisa's favorite theme

Introduction

This project builds models to examine the relationship between various environmental and socioeconomic factors and health outcomes in California counties. A large portion of the project involved compiling the data from different publicly available sources, and some decisions on what to include and disclude were made based on the data that could be found. We ultimately chose to focus our analysis on California counties because of an abundance of publicly available data and included the years 2017, 2018, and 2019 because a variety of data had been collected in that time but were uninfluenced by the pandemic.

We built models predicting four different health outcomes - asthma, breast cancer, heart disease, and Alzheimer’s disease - but the dataset we created contains more than twenty other possible health outcomes standardized on a per person basis, making it easy for anyone to create their own.

Data Sources

Data for this project was collected from a wide variety of sources, most of which are available through the California Open Data Portal. Below is a list of data sources we used, each of which were standardized by population:

Data Cleaning and Exploration

Because we combined data from 5 sources with 10+ datasets, our data cleaning process was complex and tedious. Instead of including all of them in this blog, here is a link to our GitHub repository, where all datasets and data cleaning files could be found.

Use the interactive graphic below to explore the relationships between the predictors and outcomes in our dataset. If it doesn’t load, access it here.

knitr::include_app("https://miarothberg.shinyapps.io/Final_Project/", height = "600px")

Literature

Asthma

There is an extensive body of research documenting the relationship between air pollution and asthma, which was the initial inspiration for this project. Increases in concentrations of particulate matter, sulfur dioxide, nitrogen dioxide, and ozone have been documented as directly correlated with increases in asthma-related emergency room visits. A general lack of air quality data led us to pursue other possible correlations, especially those provided by the healthy places index dataset.

Breast Cancer

Recent research has also revealed a positive correlation between those same air pollutants and breast cancer incidence rates. Variables such as education rate, income, and employment are also known to influence breast cancer prevalence, so were included in our model. Breast cancer has also been documented to be genetic, so we did not expect to be able to explain all variation in cancer rates by environmental and socioeconomic factors alone.

Alzheimer’s Disease

Long term exposure to air pollutants has been shown to contribute to cognitive decline in a variety of ways, including by increasing the risk of Alzheimer’s disease and hastening it’s onset. Maternal exposure to these pollutants has also been associated with increased risk of autism and ADHD, but we were unable to find data documenting those diagnoses.

Heart Disease

Finally, research using an EPA dataset has established that long-term exposure to particulate matter and nitrogen oxides at levels close to the National Ambient Air Quality Standards can prematurely age blood vessels and contribute to a more rapid buildup of calcium in the coronary artery, leading to heart disease.

Modeling Asthma ER Visits

pollution <- read.csv("data_standardized.csv")
pollution %>% 
  select(where(is.numeric)) %>% 
  pivot_longer(cols = everything(),
               names_to = "variable", 
               values_to = "value") %>% 
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(vars(variable), 
             scales = "free",
             nrow = 5)
pollution %>% 
  select(asthma_er_avg) %>% 
  ggplot(aes(x = asthma_er_avg)) +
  geom_histogram()
for(i in 1:ncol(pollution)) {
  median <- median(pollution[,i],na.rm = TRUE)
  for (j in 1:nrow(pollution)){
    if (is.na(pollution[j, i])){
      pollution[j, i] <- median
    }
  }
}

Based on the histogram above, we can see that asthma_er_avg is right_skewed. Also, to get rid of NA values without losing observations, we decided to replace all NAs with the median value of the column.

pollution_mod <- pollution %>%
  mutate(log_asthma = log(asthma_er_avg, base = 10)) %>% 
  select(-asthma_er_avg) %>% 
  mutate(across(where(is.character), as.factor)) %>% 
  add_n_miss() %>% 
  filter(n_miss_all == 0) %>% 
  select(-n_miss_all)

set.seed(456)
pollution_split <- initial_split(pollution_mod, prop = .75)
pollution_training <- training(pollution_split)
pollution_testing <- testing(pollution_split)

To fix to right-skewed shape, we performed a log transformation before using it as the outcome variable in models. We also made sure that all character variables are mutated to factor variables and reassured there were no more NAs.

The dataset has 153 observations of 64 variables in total. 75% (114 observations) were split to be in the training set and 25% (39 observations) were in the testing set.

Lasso Model

In modeling our data, we decided to look at five different machine learning techniques to determine the best model to predict the health outcomes due to environmental and socioeconomic models.

Lasso model uses both dimension reduction and variable selection in determining the best predictors for the outcomes. We preprocessed the data for the lasso model and removed some variables and normalized the rest of the predictors.

pollution_recipe <- recipe(log_asthma ~ ., data = pollution_training) %>% 
  step_rm(County,
          Year,
          All.causes..total.,
          Alzheimer.s.disease,
          Malignant.neoplasms,
          Chronic.lower.respiratory.diseases,
          Diabetes.mellitus,
          Assault..homicide.,
          Diseases.of.heart,
          Essential.hypertension.and.hypertensive.renal.disease,
          Accidents..unintentional.injuries.,
          Chronic.liver.disease.and.cirrhosis,
          Nephritis..nephrotic.syndrome.and.nephrosis,
          Parkinson.s.disease,
          Influenza.and.pneumonia,
          Cerebrovascular.diseases,
          Intentional.self.harm..suicide.,
          FIPS,
          cancer_incidence_rate,
          asthma_deaths,
          Bladder_Surgery_Ct,
          Brain_Surgery_Ct,
          Breast_Surgery_Ct,
          Colon_Surgery_Ct,
          Esophagus_Surgery_Ct,
          Liver_Surgery_Ct,
          Lung_Surgery_Ct,
          Pancreas_Surgery_Ct,
          Prostate_Surgery_Ct,
          Rectum_Surgery_Ct,
          Stomach_Surgery_Ct) %>% 
  step_normalize(all_predictors(), 
                 -all_nominal()) %>% 
  step_dummy(all_nominal(), 
             -all_outcomes())

Then applied it to our training data.

pollution_recipe %>% 
  prep(pollution_training) %>%
  juice()
# A tibble: 114 x 33
   annual_mean_pm25 annual_mean_ozone annual_mean_pb annual_mean_pm10
              <dbl>             <dbl>          <dbl>            <dbl>
 1           0.769             1.91           -0.438           1.20  
 2          -0.791            -0.0862         -0.176          -0.426 
 3           0.0376           -0.153          -0.176           0.136 
 4           0.130            -0.118          -0.176           0.0213
 5          -0.851             0.0999         -0.969          -0.949 
 6          -0.625            -0.107          -0.860          -0.442 
 7           0.861            -0.0905         -0.176          -0.150 
 8           1.70              0.645          -0.176          -1.08  
 9          -1.25             -1.03           -0.176          -0.150 
10           3.09             -0.0905         -0.176          -0.150 
# ... with 104 more rows, and 29 more variables:
#   annual_mean_co <dbl>, annual_mean_no2 <dbl>,
#   annual_mean_so2 <dbl>, hpi2score <dbl>, economic <dbl>,
#   education <dbl>, housing <dbl>, healthcareaccess <dbl>,
#   neighborhood <dbl>, pollution <dbl>, transportation <dbl>,
#   social <dbl>, insured <dbl>, uncrowded <dbl>,
#   homeownership <dbl>, automobile <dbl>, commute <dbl>, ...

To define the model we chose linear regression and other arguments to fit the model.

pollution_lasso_mod <- 
  linear_reg(mixture = 1) %>% 
  set_engine("glmnet") %>% 
  set_args(penalty = tune()) %>% 
  set_mode("regression")

We then created a workflow with our recipe and defined model.

pollution_lasso_wf <- 
  workflow() %>% 
  add_recipe(pollution_recipe) %>% 
  add_model(pollution_lasso_mod)
penalty_grid <- grid_regular(penalty(),
                             levels = 20)
penalty_grid 
# A tibble: 20 x 1
    penalty
      <dbl>
 1 1   e-10
 2 3.36e-10
 3 1.13e- 9
 4 3.79e- 9
 5 1.27e- 8
 6 4.28e- 8
 7 1.44e- 7
 8 4.83e- 7
 9 1.62e- 6
10 5.46e- 6
11 1.83e- 5
12 6.16e- 5
13 2.07e- 4
14 6.95e- 4
15 2.34e- 3
16 7.85e- 3
17 2.64e- 2
18 8.86e- 2
19 2.98e- 1
20 1   e+ 0

We then tuned the model using a penalty grid and determined our number of resamples.

ctrl_grid <- control_stack_grid()

set.seed(456)
pollution_cv <- vfold_cv(pollution_training, v = 5)

pollution_lasso_tune <- 
  pollution_lasso_wf %>% 
  tune_grid(
    resamples = pollution_cv,
    grid = penalty_grid,
    control = ctrl_grid
    )

pollution_lasso_tune
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 x 5
  splits          id    .metrics          .notes       .predictions   
  <list>          <chr> <list>            <list>       <list>         
1 <split [91/23]> Fold1 <tibble [40 x 5]> <tibble [1 ~ <tibble [460 x~
2 <split [91/23]> Fold2 <tibble [40 x 5]> <tibble [1 ~ <tibble [460 x~
3 <split [91/23]> Fold3 <tibble [40 x 5]> <tibble [1 ~ <tibble [460 x~
4 <split [91/23]> Fold4 <tibble [40 x 5]> <tibble [1 ~ <tibble [460 x~
5 <split [92/22]> Fold5 <tibble [40 x 5]> <tibble [1 ~ <tibble [440 x~

Looked at the RMSE of the model output to determine the best model to use on the testing data.

pollution_lasso_tune %>% 
  select(id, .metrics) %>% 
  unnest(.metrics) %>% 
  filter(.metric == "rmse")
# A tibble: 100 x 6
   id     penalty .metric .estimator .estimate .config              
   <chr>    <dbl> <chr>   <chr>          <dbl> <chr>                
 1 Fold1 1   e-10 rmse    standard       0.232 Preprocessor1_Model01
 2 Fold1 3.36e-10 rmse    standard       0.232 Preprocessor1_Model02
 3 Fold1 1.13e- 9 rmse    standard       0.232 Preprocessor1_Model03
 4 Fold1 3.79e- 9 rmse    standard       0.232 Preprocessor1_Model04
 5 Fold1 1.27e- 8 rmse    standard       0.232 Preprocessor1_Model05
 6 Fold1 4.28e- 8 rmse    standard       0.232 Preprocessor1_Model06
 7 Fold1 1.44e- 7 rmse    standard       0.232 Preprocessor1_Model07
 8 Fold1 4.83e- 7 rmse    standard       0.232 Preprocessor1_Model08
 9 Fold1 1.62e- 6 rmse    standard       0.232 Preprocessor1_Model09
10 Fold1 5.46e- 6 rmse    standard       0.232 Preprocessor1_Model10
# ... with 90 more rows

We then visualized the RMSE as a function of the penalty value.

pollution_lasso_tune %>% 
  collect_metrics() %>% 
  filter(.metric == "rmse") %>% 
  ggplot(aes(x = penalty, y = mean)) +
  geom_point() +
  geom_line() +
  scale_x_log10(
   breaks = scales::trans_breaks("log10", function(x) 10^x),
   labels = scales::trans_format("log10",scales::math_format(10^.x))) +
  labs(x = "penalty", y = "rmse")

We used the show_best() and select_best() functions to select the best model.

pollution_lasso_tune %>% 
  show_best(metric = "rmse")
# A tibble: 5 x 7
   penalty .metric .estimator  mean     n std_err .config             
     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 0.00785  rmse    standard   0.282     5  0.0127 Preprocessor1_Model~
2 0.00234  rmse    standard   0.302     5  0.0147 Preprocessor1_Model~
3 0.0264   rmse    standard   0.328     5  0.0154 Preprocessor1_Model~
4 0.000695 rmse    standard   0.336     5  0.0344 Preprocessor1_Model~
5 0.000207 rmse    standard   0.346     5  0.0440 Preprocessor1_Model~
best_param <- pollution_lasso_tune %>% 
  select_best(metric = "rmse")
best_param
# A tibble: 1 x 2
  penalty .config              
    <dbl> <chr>                
1 0.00785 Preprocessor1_Model16
one_se_param <- pollution_lasso_tune %>% 
  select_by_one_std_err(metric = "rmse", desc(penalty))
one_se_param
# A tibble: 1 x 9
  penalty .metric .estimator  mean     n std_err .config  .best .bound
    <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>    <dbl>  <dbl>
1 0.00785 rmse    standard   0.282     5  0.0127 Preproc~ 0.282  0.295

We then finalized the workflow using the selected lasso model.

pollution_lasso_final_wf <- pollution_lasso_wf %>% 
  finalize_workflow(one_se_param)
pollution_lasso_final_wf
== Workflow ==========================================================
Preprocessor: Recipe
Model: linear_reg()

-- Preprocessor ------------------------------------------------------
3 Recipe Steps

* step_rm()
* step_normalize()
* step_dummy()

-- Model -------------------------------------------------------------
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0.00784759970351461
  mixture = 1

Computational engine: glmnet 

Using this finalized workflow, we fit the data to the training data and visualized the estimate values.

pollution_lasso_final_mod <- pollution_lasso_final_wf %>% 
  fit(data = pollution_training)

pollution_lasso_final_mod %>% 
  pull_workflow_fit() %>% 
  tidy() 
# A tibble: 33 x 3
   term              estimate penalty
   <chr>                <dbl>   <dbl>
 1 (Intercept)        2.97    0.00785
 2 annual_mean_pm25   0       0.00785
 3 annual_mean_ozone  0.00160 0.00785
 4 annual_mean_pb     0.0210  0.00785
 5 annual_mean_pm10  -0.161   0.00785
 6 annual_mean_co     0.0539  0.00785
 7 annual_mean_no2   -0.132   0.00785
 8 annual_mean_so2    0.0652  0.00785
 9 hpi2score          0       0.00785
10 economic           0       0.00785
# ... with 23 more rows

We then looked at the evaluation metrics across all folds.

pollution_lasso_test <- pollution_lasso_final_wf %>% 
  last_fit(pollution_split)

# Metrics for model applied to test data
pollution_lasso_test %>% 
  collect_metrics()
# A tibble: 2 x 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.267 Preprocessor1_Model1
2 rsq     standard       0.846 Preprocessor1_Model1

Xgboost model

First we used the use_xgboost() function, which gives recommendation on how to build our recipe, workflow, etc.

use_xgboost(log_asthma ~ ., data = pollution_training)
xgboost_recipe <- 
  recipe(formula = log_asthma ~ ., data = pollution_training) %>% 
  step_novel(all_nominal(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% 
  step_zv(all_predictors()) 

xgboost_spec <- 
  boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), 
    loss_reduction = tune(), sample_size = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

xgboost_workflow <- 
  workflow() %>% 
  add_recipe(xgboost_recipe) %>% 
  add_model(xgboost_spec) 

set.seed(90624)
xgboost_tune <-
  tune_grid(xgboost_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))

Based on output from the use_xgboost() function and our previous recipes, we created our own recipe.

boost_recipe <- 
  recipe(formula = log_asthma ~ ., data = pollution_training) %>% 
  step_rm(County,
          Year,
          All.causes..total.,
          Alzheimer.s.disease,
          Malignant.neoplasms,
          Chronic.lower.respiratory.diseases,
          Diabetes.mellitus,
          Assault..homicide.,
          Diseases.of.heart,
          Essential.hypertension.and.hypertensive.renal.disease,
          Accidents..unintentional.injuries.,
          Chronic.liver.disease.and.cirrhosis,
          Nephritis..nephrotic.syndrome.and.nephrosis,
          Parkinson.s.disease,
          Influenza.and.pneumonia,
          Cerebrovascular.diseases,
          Intentional.self.harm..suicide.,
          FIPS,
          cancer_incidence_rate,
          asthma_deaths,
          Bladder_Surgery_Ct,
          Brain_Surgery_Ct,
          Breast_Surgery_Ct,
          Colon_Surgery_Ct,
          Esophagus_Surgery_Ct,
          Liver_Surgery_Ct,
          Lung_Surgery_Ct,
          Pancreas_Surgery_Ct,
          Prostate_Surgery_Ct,
          Rectum_Surgery_Ct,
          Stomach_Surgery_Ct) %>% 
  step_novel(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors(), one_hot = TRUE)  %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors(), 
                 -all_nominal())

Next, we specified the model. learn_rate is the parameter we decided to tune, and we created a grid of 10 values to try in tuning for learn_rate as the following:

boost_spec <- boost_tree(
  trees = 1000,            # number of trees, T in the equations above
  tree_depth = 2,          # max number of splits in the tree
  min_n = 5,               # min points required for node to be further split
  loss_reduction = 10^-5,  # when to stop - smaller = more since it only has to get a little bit better 
  sample_size = 1,         # proportion of training data to use
  learn_rate = tune(),     # lambda from the equations above
  stop_iter = 50           # number of iterations w/o improvement b4 stopping
) %>% 
  set_engine("xgboost", colsample_bytree = 1) %>% 
  set_mode("regression")
boost_grid <- grid_regular(learn_rate(),
                           levels = 10)
boost_grid
# A tibble: 10 x 1
     learn_rate
          <dbl>
 1 0.0000000001
 2 0.000000001 
 3 0.00000001  
 4 0.0000001   
 5 0.000001    
 6 0.00001     
 7 0.0001      
 8 0.001       
 9 0.01        
10 0.1         

We then put the recipe and model specification into a workflow.

boost_wf <- workflow() %>% 
  add_recipe(boost_recipe) %>%
  add_model(boost_spec) 

The next step is to train these models. registerDoParallel() was used to speed up the process.

set.seed(456)
registerDoParallel()

boost_tune <- boost_wf %>% 
  tune_grid(
  # resamples = val_split,
  resamples = pollution_cv,
  grid = boost_grid,
  control = ctrl_grid
)

Here is a table summarizing the results. We can see that larger learning rates actually seem to do better.

collect_metrics(boost_tune)
# A tibble: 20 x 7
     learn_rate .metric .estimator     mean     n std_err .config     
          <dbl> <chr>   <chr>         <dbl> <int>   <dbl> <chr>       
 1 0.0000000001 rmse    standard     2.57       5  0.0333 Preprocesso~
 2 0.0000000001 rsq     standard   NaN          0 NA      Preprocesso~
 3 0.000000001  rmse    standard     2.57       5  0.0333 Preprocesso~
 4 0.000000001  rsq     standard   NaN          0 NA      Preprocesso~
 5 0.00000001   rmse    standard     2.57       5  0.0333 Preprocesso~
 6 0.00000001   rsq     standard     0.0589     1 NA      Preprocesso~
 7 0.0000001    rmse    standard     2.57       5  0.0333 Preprocesso~
 8 0.0000001    rsq     standard     0.741      5  0.0486 Preprocesso~
 9 0.000001     rmse    standard     2.57       5  0.0333 Preprocesso~
10 0.000001     rsq     standard     0.751      5  0.0511 Preprocesso~
11 0.00001      rmse    standard     2.55       5  0.0333 Preprocesso~
12 0.00001      rsq     standard     0.746      5  0.0603 Preprocesso~
13 0.0001       rmse    standard     2.34       5  0.0340 Preprocesso~
14 0.0001       rsq     standard     0.764      5  0.0594 Preprocesso~
15 0.001        rmse    standard     1.09       5  0.0382 Preprocesso~
16 0.001        rsq     standard     0.800      5  0.0569 Preprocesso~
17 0.01         rmse    standard     0.217      5  0.0187 Preprocesso~
18 0.01         rsq     standard     0.915      5  0.0377 Preprocesso~
19 0.1          rmse    standard     0.134      5  0.0130 Preprocesso~
20 0.1          rsq     standard     0.959      5  0.0155 Preprocesso~

We also plotted the rmse to visually support our conclusion above.

collect_metrics(boost_tune) %>% 
  filter(.metric == "rmse") %>% 
  ggplot(aes(x = learn_rate, y = mean)) +
  geom_point() +
  geom_line() +
  scale_x_log10() +
  labs(y = "rmse") +
  theme_minimal()

Then we decided to select the best learning rate parameter and finalize the model.

best_lr <- select_best(boost_tune, "rmse")
best_lr
# A tibble: 1 x 2
  learn_rate .config              
       <dbl> <chr>                
1        0.1 Preprocessor1_Model10
# finalize workflow
final_boost_wf <- finalize_workflow(
  boost_wf,
  best_lr
)

# fit final
final_boost <- final_boost_wf %>% 
  fit(data = pollution_training)

This graph tells us which predictors are the most important. It seems that supermarkets, annual_mean_no2, and homeownership have the largest influence on asthma ER visit.

final_boost %>% 
  pull_workflow_fit() %>%
  vip(geom = "col")

The final model was also fitted to the testing data. We can see the testing rmse is 0.05, larger than the training rmse.

# Use model on test data
test_preds <- pollution_testing %>% 
  bind_cols(predict(final_boost, new_data = pollution_testing)) 

# Compute test rmse
test_preds %>% 
  summarize(rmse = sqrt(mean((log_asthma - .pred)^2))) %>% 
  pull(rmse)
[1] 0.05400519

The graph below visualized the results. We can conclude that the predicted values are very close to the actual values.

# Graph results
test_preds %>% 
  ggplot(aes(x = log_asthma,
             y = .pred)) +
  geom_point(alpha = .5, 
             size = .5) +
  geom_smooth(se = FALSE) +
  geom_abline(slope = 1, 
              intercept = 0, 
              color = "darkred") +
  labs(x = "Actual log(asthma)", 
       y = "Predicted log(asthma)")

Random Forest Model

We set up the random forest model with its recipe and preprocessing steps.

pollution_rfrecipe <- recipe(log_asthma ~ ., data = pollution_training) %>% 
  step_rm(County,
          Year,
          All.causes..total.,
          Alzheimer.s.disease,
          Malignant.neoplasms,
          Chronic.lower.respiratory.diseases,
          Diabetes.mellitus,
          Assault..homicide.,
          Diseases.of.heart,
          Essential.hypertension.and.hypertensive.renal.disease,
          Accidents..unintentional.injuries.,
          Chronic.liver.disease.and.cirrhosis,
          Nephritis..nephrotic.syndrome.and.nephrosis,
          Parkinson.s.disease,
          Influenza.and.pneumonia,
          Cerebrovascular.diseases,
          Intentional.self.harm..suicide.,
          FIPS,
          cancer_incidence_rate,
          asthma_deaths,
          Bladder_Surgery_Ct,
          Brain_Surgery_Ct,
          Breast_Surgery_Ct,
          Colon_Surgery_Ct,
          Esophagus_Surgery_Ct,
          Liver_Surgery_Ct,
          Lung_Surgery_Ct,
          Pancreas_Surgery_Ct,
          Prostate_Surgery_Ct,
          Rectum_Surgery_Ct,
          Stomach_Surgery_Ct) %>% 
  step_normalize(all_predictors(), 
                 -all_nominal()) %>% 
  step_dummy(all_nominal(), 
             -all_outcomes())

Applied the recipe to the training data.

pollution_rfrecipe %>% 
  prep(pollution_training) %>%
  juice()
# A tibble: 114 x 33
   annual_mean_pm25 annual_mean_ozone annual_mean_pb annual_mean_pm10
              <dbl>             <dbl>          <dbl>            <dbl>
 1           0.769             1.91           -0.438           1.20  
 2          -0.791            -0.0862         -0.176          -0.426 
 3           0.0376           -0.153          -0.176           0.136 
 4           0.130            -0.118          -0.176           0.0213
 5          -0.851             0.0999         -0.969          -0.949 
 6          -0.625            -0.107          -0.860          -0.442 
 7           0.861            -0.0905         -0.176          -0.150 
 8           1.70              0.645          -0.176          -1.08  
 9          -1.25             -1.03           -0.176          -0.150 
10           3.09             -0.0905         -0.176          -0.150 
# ... with 104 more rows, and 29 more variables:
#   annual_mean_co <dbl>, annual_mean_no2 <dbl>,
#   annual_mean_so2 <dbl>, hpi2score <dbl>, economic <dbl>,
#   education <dbl>, housing <dbl>, healthcareaccess <dbl>,
#   neighborhood <dbl>, pollution <dbl>, transportation <dbl>,
#   social <dbl>, insured <dbl>, uncrowded <dbl>,
#   homeownership <dbl>, automobile <dbl>, commute <dbl>, ...
Create the model using regressions and ranger as the engine.
ranger_pollution <- 
  rand_forest(mtry = tune(), 
              min_n = tune(), 
              trees = 200) %>% 
  set_mode("regression") %>% 
  set_engine("ranger")
Set up workflow with the recipe and random forest model made above.
pollution_rfworkflow <- 
  workflow() %>% 
  add_recipe(pollution_rfrecipe) %>% 
  add_model(ranger_pollution) 

Set up penalty grid model.

rf_penalty_grid <- grid_regular(finalize(mtry(),
                                         pollution_training %>%
                                           select(-log_asthma)),
                                min_n(),
                                levels = 3)

Tuned the model and determine the best RMSE.

set.seed(456) 

pollution_cv <- vfold_cv(pollution_training, v = 5) 

pollution_rfTUNE <- 
  pollution_rfworkflow %>% 
  tune_grid(
    resamples = pollution_cv,
    grid = rf_penalty_grid,
    control = ctrl_grid
  )

best_rmse <- 
  pollution_rfTUNE %>%
  select_best(metric = "rmse")

best_rmse
# A tibble: 1 x 3
   mtry min_n .config             
  <int> <int> <chr>               
1    32     2 Preprocessor1_Model2

Fitting model to testing data.

pol_rfFinal <- pollution_rfworkflow %>%
  finalize_workflow(best_rmse) %>%
  fit(data = pollution_training)

Finding the RMSE of the testing data.

predictions_rf <- pollution_testing %>% 
  select(log_asthma) %>% 
  bind_cols(predict(pol_rfFinal, new_data = pollution_testing))

predictions_rf %>%
  summarize(training_rmse = sqrt(mean((log_asthma - .pred)^2)))
  training_rmse
1    0.08511188
predictions_rf %>% 
  ggplot(aes(x = log_asthma,
             y = .pred)) +
  geom_point(alpha = .5, 
             size = .5) +
  geom_smooth(se = FALSE) +
  geom_abline(slope = 1, 
              intercept = 0, 
              color = "darkred") +
  labs(x = "Actual log(asthma)", 
       y = "Predicted log(asthma)")

Stacking

After we had our set of candidate models (9 random forest, 20 lasso, and 10 boost), we started stacking. We followed the process laid out on the stacks webpage.

First, we created the stack. Shown by the number of columns, it removed some of the lasso models, likely since they were too similar to other lasso models. The set of candidate models now has 9 random forest, 9 lasso, and 9 boost.

pollution_stack <- 
  stacks() %>% 
  add_candidates(boost_tune) %>%
  add_candidates(pollution_lasso_tune) %>% 
  add_candidates(pollution_rfTUNE)

We can look at the predictions from the candidate models in a tibble. Most of the predictions are very close to the actual values, except the first few boost models.

as_tibble(pollution_stack)
# A tibble: 114 x 28
   log_asthma boost_tune_1_01 boost_tune_1_03 boost_tune_1_04
        <dbl>           <dbl>           <dbl>           <dbl>
 1       3.65             0.5             0.5           0.500
 2       2.06             0.5             0.5           0.500
 3       2.06             0.5             0.5           0.500
 4       2.82             0.5             0.5           0.500
 5       3.05             0.5             0.5           0.500
 6       3.98             0.5             0.5           0.500
 7       1.95             0.5             0.5           0.500
 8       2.29             0.5             0.5           0.500
 9       2.17             0.5             0.5           0.500
10       1.73             0.5             0.5           0.500
# ... with 104 more rows, and 24 more variables:
#   boost_tune_1_05 <dbl>, boost_tune_1_06 <dbl>,
#   boost_tune_1_07 <dbl>, boost_tune_1_08 <dbl>,
#   boost_tune_1_09 <dbl>, boost_tune_1_10 <dbl>,
#   pollution_lasso_tune_1_01 <dbl>, pollution_lasso_tune_1_13 <dbl>,
#   pollution_lasso_tune_1_14 <dbl>, pollution_lasso_tune_1_15 <dbl>,
#   pollution_lasso_tune_1_16 <dbl>, ...

We wanted to blend the predictions from each model together to form an even better overall prediction using the blend_predictions() function. Doing this with our models, we see only 4 models have non-zero coefficients, 1 random forest, 2 lasso, and 1 boost:

set.seed(456)

pollution_blend <- 
  pollution_stack %>% 
  blend_predictions()

pollution_blend
# A tibble: 4 x 3
  member                    type        weight
  <chr>                     <chr>        <dbl>
1 pollution_rfTUNE_1_2      rand_forest 0.727 
2 boost_tune_1_10           boost_tree  0.282 
3 pollution_lasso_tune_1_18 linear_reg  0.0699
4 pollution_lasso_tune_1_19 linear_reg  0.0127

Here is the rmse for the various penalty parameters:

pollution_blend$metrics %>% 
  filter(.metric == "rmse")
# A tibble: 6 x 8
   penalty mixture .metric .estimator  mean     n std_err .config     
     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>       
1 0.000001       1 rmse    standard   0.104    25 0.00232 Preprocesso~
2 0.00001        1 rmse    standard   0.104    25 0.00232 Preprocesso~
3 0.0001         1 rmse    standard   0.104    25 0.00232 Preprocesso~
4 0.001          1 rmse    standard   0.104    25 0.00234 Preprocesso~
5 0.01           1 rmse    standard   0.105    25 0.00260 Preprocesso~
6 0.1            1 rmse    standard   0.152    25 0.00470 Preprocesso~

We can examine some plots to see if we need to adjust the penalty parameter at all. This set of three plots with penalty on the x axis shows that we seem to have captured the smallest RMSE.

autoplot(pollution_blend)

The blending weights for the top ensemble members are presented in this plot. The random forest has the highest weight, more than two times of the second highest weight boost has. The two lasso models have low weights.

autoplot(pollution_blend, type = "weights")

Finally we fit the candidate models with non-zero stacking coefficients to the full training data using fit_members() function. The numeric values of the blending weights for the top ensemble members are also printed out.

pollution_final_stack <- pollution_blend %>% 
  fit_members()

pollution_final_stack
# A tibble: 4 x 3
  member                    type        weight
  <chr>                     <chr>        <dbl>
1 pollution_rfTUNE_1_2      rand_forest 0.727 
2 boost_tune_1_10           boost_tree  0.282 
3 pollution_lasso_tune_1_18 linear_reg  0.0699
4 pollution_lasso_tune_1_19 linear_reg  0.0127

Here is a plot comparing predicted and actual values after fitting it to the testing data. We can see the two lines are almost the same!

pollution_final_stack %>% 
  predict(new_data = pollution_testing) %>% 
  bind_cols(pollution_testing) %>% 
  ggplot(aes(x = log_asthma, 
             y = .pred)) +
  geom_point(alpha = .5, 
             size = .5) +
  geom_smooth(se = FALSE) +
  geom_abline(slope = 1, 
              intercept = 0, 
              color = "darkred") +
  labs(x = "Actual log(asthma)", 
       y = "Predicted log(asthma)")

data_rmse <- pollution_final_stack %>% 
  predict(new_data = pollution_testing) %>% 
  bind_cols(pollution_testing) 

rmse(data_rmse$log_asthma, data_rmse$.pred)
[1] 0.080368

The final plot, which shows the blue line of the predicted values nearly match the red line of the ideal values, and the low RMSE show that this model was fairly effective.

Modeling Other Variables

We created models in the same way for three other variables, which you can read about in the links below:

Overall, the Alzheimer’s disease death rate model was most effective at predicting the outcome variable and the breast cancer surgery rate model was the least effective. This suggests that many of the factors that worsen Alzheimer’s disease were present in our dataset but that it was missing factors that contribute to breast cancer. This makes sense, as breast cancer has been shown to be hereditary.

Limitations, Troubleshooting, and Potential Repercussions

One major limitation with this project was the lack of data available. While we initially knew we wanted to focus on the relationship between air pollutants and various health outcomes, we chose to focus solely on California because that was the data that was available. We could only find limited data on air pollutants, so expanded our dataset to include data that we could find on socioeconomic variables. Additionally, many of the air pollutants had missing values, and we found that our dataset only had 13 complete cases that our models were being based on. To deal with this, we filled in all the NA values with the median value for each variable. This likely weakens the models, though they were still fairly strong.

If we only look at the visualizations displayed in the Shiny app, it is possible to come to conclusions that in reality do not make sense. For example, there seems to be a negative relationship between asthma ER visits and mean annual particulate matter 2.5, despite an extensive body of research and logic that indicate an opposite relationship. That is to say, because we have limited data from a certain area and many other factors play important roles affecting asthma, we should limit conclusions from examining the plots alone. It is also important to keep in mind that these models are not predicting health outcomes for any given individual; they are predicting the per person rate of the health outcome in each county. The models should not be used to determine insurance rates, but they could be used as the basis for policies that, for example, lower air pollutant limits or advocate for more accessible education.

It is also important to keep in mind that these models are not predicting health outcomes for any given individual; they are predicting the per person rate of the health outcome in each county. The models should not be used to determine insurance rates, but they could be used as the basis for policies that, for example, lower air pollutant limits or advocate for more accessible education.

Outcomes/Future Directions

The models created in this project were able to predict the chosen outcome variables with a fair amount of accuracy. A next step would be to focus on more interpretable machine learning, specifically creating an interactive ceteris paribus profile to easily visualize the exact impacts of each variable on the predictor.

The dataset we created contains numerous other health outcomes, each of which could be the subject of its own model with relatively little additional work. We chose to model the outcomes that we did because of previous research tying them to environmental factors, but it would be interesting to look for connections for the other variables as well.

Additionally, it would be useful to add more predictors to the dataset, especially those focused on the race and age distributions of each county. Those are currently missing from the dataset we built but are known risk factors for many health outcomes, so adding them may help to reveal any potentially confounding variables.